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Abstract 

Gene duplication is considered to be a major driving force in evolution that enables the genome of a species to 
acquire new functions. A reconciliation - a mapping of gene tree vertices to the edges or vertices of a species tree - 
explains where gene duplications have occurred on the species tree. In this study, we sample reconciliations from a 
posterior over reconciliations, gene trees, edge lengths and other parameters, given a species tree and gene 
sequences. We employ a Bayesian analysis tool, based on the probabilistic model DLRS that integrates gene 
duplication, gene loss and sequence evolution under a relaxed molecular clock for substitution rates, to obtain this 
posterior. 

By applying these methods, we perform a genome-wide analysis of a nine species dataset, OPTIC, and conclude 
that for many gene families, the most parsimonious reconciliation (MPR) - a reconciliation that minimizes the 
number of duplications - is far from the correct explanation of the evolutionary history. For the given dataset, we 
observe that approximately 19% of the sampled reconciliations are different from MPR. This is in clear contrast with 
previous estimates, based on simpler models and less realistic assumptions, according to which 98% of the 
reconciliations can be expected to be identical to MPR. We also generate heatmaps showing where in the species 
trees duplications have been most frequent during the evolution of these species. 



Introduction 

Phylogenetics - traditionally a field primarily concerned 
with inferring tree-like evolution of species - has 
recently been superseded by phylogenomics - which also 
includes the evolution of genomes and their functional 
elements, in particular the genes, in relation to the spe- 
cies evolution. This genomics evolution is for many 
areas of biology, e.g., molecular biology, the final goal, 
to which the species evolution then is a means. In parti- 
cular, evolution of genes across species is a result of 
evolutionary processes such as gene duplication and 
loss, which in eukaryotes have been shown to be major 
driving forces in gene evolution. 
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Goodman et al [1] pioneered the field by introducing 
the notion of a reconciliation of the evolutionary history 
of a gene family, represented by a gene tree, with that of 
the corresponding species, represented by a species tree. 
In general, a reconciliation is a mapping of the gene tree 
vertices onto the species tree. Each internal vertex of the 
gene tree is either mapped to (1) a species tree vertex, 
which implies that the gene tree vertex represents a spe- 
ciation or (2) a species tree edge, which implies that the 
gene tree vertex represents a duplication. Goodman et 
al. used a parsimony approach and gave an algorithm 
that finds the most parsimonious reconciliation (MPR), 
i.e., the unique mapping that explains the difference 
between the gene and species trees using a minimum 
number of duplications [1]. 

Arvestad et al. [2] introduced the first probabilistic 
model for gene evolution, which explains how a gene 
family evolves inside a species tree by undergoing 
operations such as gene duplications and gene losses. 
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Later Arvestad et al. [3] proposed an integrated model 
of gene duplication, gene loss, and sequence evolution, 
under a molecular clock for estimating the posterior 
distribution over gene trees. A Markov Chain Monte 
Carlo (MCMC) based approach was used to get the 
posterior distribution over gene trees, given sequence 
data for a gene family and the corresponding species 
tree. Akerborg et al. [4] improved the model by introdu- 
cing a relaxed molecular clock for sequence evolution 
integrated with gene duplication and gene loss. This fra- 
mework efficiently computes the posterior over gene 
trees. Nevertheless, they do not suggest how to obtain 
reconciliations from the posterior distribution over gene 
trees. Rasmussen et al. [5] recently introduced another 
probabilistic approach to reconstruct gene trees inside 
the species tree. The method uses a hill-climbing-based 
approach, but it only considers MPR while computing 
the likelihood of a gene tree. They supported this 
assumption by a simulation study, where they simulated 
reconciled gene trees for the species tree using indepen- 
dently estimated duplication and loss rates [6], and 
found that 98% of all generated reconciliations were 
identical to MPR. Doyon et al. [7] reported similar 
results, and concluded that the most likely reconcilia- 
tion is either identical to MPR or very close to MPR. In 
a recent study, Doyon et al. [8] using a simple birth- 
death process and realistic but averaged gene duplica- 
tion/loss rates, found that a very small subset of all 
reconciliations needs to be explored in order to approx- 
imate the posterior probability of the most likely recon- 
ciliations. Akerborg et al. [4], on the other hand argued 
that MPR provides an incorrect explanation of the evo- 
lutionary history of gene families that have a higher 
duplication rate. 

Recently, genomes of different species have been pub- 
lished with increasingly better coverage. For instance, 
Heger et al. [9] published the Orthologous and Paralo- 
gous Transcripts in Clades (OPTIC) database, which pro- 
vides sets of gene prediction, gene families, and orthology 
assignments for clades of amniotes, vertebrates, flies, 
nematodes and yeasts. In this study, we extend the fra- 
mework by Akerborg et al. [4], for computing the poster- 
ior over gene trees, by proposing algorithms for sampling 
reconciliations as well as computing the most likely 
reconciliations on the vertebrates clade of OPTIC data- 
set. This allows us to perform a genome-wide study on 
the OPTIC dataset, in which posteriors over gene trees 
and reconciliations are estimated. We augment the spe- 
cies tree by adding a heatmap for each edge, illustrating 
how frequently duplications occur on the edge, among 
the gene families. We also compare the reconciliations 
we obtain with the most parsimonious and conclude that 
MPR leads to an incorrect reconciliation in 19% of 
all reconciliations. Finally, we propose algorithms for 



sampling and computing the most likely realizations (a 
finer reconciliation, that maps vertices of the gene tree to 
specific time points on the species tree). 

Methods 

In this section, we review the DLRS model and some 
algorithmic results from [4]. We then continue to show 
how the latter can be extended so that reconciliation 
and so-called discretized realizations can be sampled 
from the posterior distribution, as well as maximum 
aposteriori (MAP) reconciliations and realizations can 
be computed. Finally, we describe how the difference 
between two reconciliations can be quantified. 

The DLRS model and notation 

The DLRS model, proposed by Akerborg et. al. in [4], is 
based on three submodels: a duplication & loss model 
(DL), a substitution rate model (R), and a sequence evo- 
lution model (S) (see Figure 1). The duplication & loss 
submodel captures the evolution of a gene tree inside a 
species tree with given divergence times. For a tree T, 
we use V (T), E(T), and L(T) to denote the set of ver- 
tices, edges, and leaves of a tree T, respectively. Along 
an edge e e E(S) of the species tree, gene duplications 
and losses are modeled by a linear birth-death process. 
The duplication & loss process has two rates that are 
used, in the natural way, as rates for the birth-death 
process. A relaxed molecular clock is assumed for the 
substitution rate submodel. The gene tree edges have 
substitution rates, which are independently and identi- 
cally T-distributed and parameterized by a mean and a 
variance. The sequence evolution submodel, can be any 
standard sequence evolution model, e.g., JTT, which is 
the case of this study. 

We use planted binary gene and species trees, i.e., the 
trees can be obtained by adding a new vertex, a so- 
called planted root, to a rooted binary tree and making 
the planted root and the root adjacent. Moreover, let 
8 = (A, pi, m, v, M ) be the parameters of the model, 
where A is the gene duplication rate, n the gene loss 
rate, m the mean and v the variance of the distribution 
for sequence evolution rates across gene tree edges, and, 
finally, M the parameters of the sequence evolution 
model. Let T be a rooted tree and we V (T). The sub- 
tree of T rooted at u, T u , is the minimal subtree of T 
containing all descendants of u, including u. The subtree 
of T planted at u, denoted V is defined to be the sub- 
tree rooted at u, T u , together with the edge from u to 
its parent. 

MCMC estimation of DLRS posterior over gene trees 

We now describe the MCMC based framework 
employed in [4], which uses the Metropolis-Hastings 
algorithm for inference of the posterior over gene trees 
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(A) Duplication Loss Model (DL) 




(B) Substitution Rates (R) 



n 



(C) Sequence Evolution (S) 

ACTA... G A 
AGTA...GT 



Clock relaxation 

Figure 1 The three submodels of DLRS are shown. (A) Evolution of a gene lineage inside a species tree edge is modeled by a birth-death 
process. A gene lineage may come across a duplication event (represented by a green vertex), or a speciation event (represented by a black 
vertex). Every time a gene lineage passes through a speciation event, it splits into two independent gene lineages. A gene lineage may also be 
lost (represented by a red cross). After pruning all lost lineages, the final gene tree is obtained. (B) A relaxed molecular clock is employed to 
achieve branch lengths. (C) Finally, a standard sequence evolution model generates sequences over the gene tree with branch lengths. 



and rate parameters, and we also show how it can be 
extended to facilitate sampling of reconciliations from 
the posterior distribution. A state of the MCMC chain 
is a triple (G, I, 6) where the components are: a gene 
tree G with lengths /, and the parameters of the DLRS 
model 8. We use P (•) to denote a probability and p(-) to 
denote a probability density. Let (G, I, 9) be the current 
state and let (G', F, ff) be the proposed state in an itera- 
tion of the MCMC algorithm. The acceptance probabil- 
ity of the proposed state (G', I', ff) is determined by the 
ratio of the two probability densities p{G, I, 9\D, S) and 
p{G', I', 8'\D, S), where D is gene sequence data and S is 
the species tree. Since each such can be express using 
Bayes equality, e.g., 



p(G, l,9\D, S) = 



P(D\G, l)p(G,l\9,S)p(9) 



P(D\S) 

the denominators cancel and we obtain 
p (G, l,9\D, S) P (D|G, I) p (G l\9, S) p (9) 



p (G, I', 9'\D, S) P (D|G, V) p (G, l'\9', S) p (9') ' 

Here the numerator and denominator have the same 
structure, so it is sufficient to describe how to compute 
the former. First, the factor P {D\G, I) can be computed 



using the dynamic programming (DP) algorithm pro- 
posed by Felsenstein [10]. Second, the prior p{9) is cho- 
sen so that it can be easily computed. Finally, the main 
technical contribution of [4] is a DP algorithm for com- 
puting the remaining factor p(G, l\8, S), and we continue 
by outlining that approach. 

Let us first define some key concepts. Let S' be a dis- 
cretized species tree where edges of the species tree S 
have been augmented with additional discretization ver- 
tices such that all the augmented vertices are equidistant 
within an edge, see supplementary Figure 2 in additional 
file 1. 

Furthermore, we define a reconciliation to be a map- 
ping of vertices of a gene tree V (G) to the vertices and 
edges of the species tree, i.e., V (S) U E(S). A discretized 
realization, or d-realization, a, is a mapping of vertices 
of a gene tree G to the vertices of the discretized species 
tree S '. 

A realization never maps a vertex and its parent to 
same vertex x e V (5"). We consider sound reconcilia- 
tions and realizations, e.g., they never map a vertex of 
the gene tree u closer to the root than the position to 
which it maps its parental vertex, and ensures G is 
properly embedded within S. Moreover, let o{u) be the 
function defined as follows (i) for a leaf u e L(G), o(u) 
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is the species tree leaf in which the gene that u repre- 
sents can be found and (ii) for any internal vertex u of 
G, a(u) is the least common ancestor of L(G U ) in S. 

Extending the posterior to d-realizations or 
reconciliations 

In order to extend the MCMC sampling from the pos- 
terior over gene trees with lengths and parameters, i.e., 
p(G, I, 0\D, S) to sampling also over d-realizations, i.e., 
p{G, I, a, 8\D, S), it is sufficient to be able to sample 
from p(a\G, I, 6, S). This conclusion follows from 

p(C, l, a, 6 |D, S) = p (a\G, 1, 6, D, S) p (C, l,e\D, S) = p (a\G, 1, 6, S) p (G, I, e\D, S) . 



3. If x e V (S) \L(S) and x = a(u), 

s (x, x, u) I s i x ' y> v ) I I ^2 s { x > Y' w ) I ' 

\y€D L (x) J \yeD R (x) J 

where D L (x) and D R (x) are the descendants of left 
and right child of x in S', respectively. 

4. If x g V (S) and z is a child of x such that 
ct(L(G u )) c L(S^.)and z is an ancestor of y, 

, „ , , , p{l{p(u),u)/t(x,y)) 
s (x, x, u) = pn (x, z) e (x, z) — r— — rr-s (z, y, u) , 
p(l(p(u),u) t(z,y)) 



The analogous statement is true for a reconciliation, y, 
that is, if we can sample from the reconciliation poster- 
ior distribution p(y\G, I, 6, S), then we can also sample 
from the full posterior p(G, I, y, 0\D, S). 

In practice, sampling from the posterior extended with 
d-realizations, p(G, I, a, 0\D, S), is perfomed by first 
running the DLRS posterior MCMC so that k samples 
(Gi, l lt 6-l), (G k , l k , 8 k ) are obtained and, then for 
each i e [k], sample a, from pia^Gj, Ij, 6„ S). The sam- 
ples from p(G, I, a, 0\D, S) are, finally, (Gi, l it a lt 6\), 
{G h l h a h 6 k ). 

There is a unique reconciliation associated with each 
realization and the posterior probability of a reconcilia- 
tion is approximated by the sum of the posterior prob- 
abilities of the d-realizations associated with it. Thus, we 
can sample a reconciliation, from the posterior distribu- 
tion over those, by sampling a d-realization, from the 
posterior over those, and then outputting the associated 
reconciliation, which easily can be computed. So by fol- 
lowing the above described procedure and then, for 
each i e [k], computing the reconciliation y ; associated 
with a i, we obtain k samples (Gi, l x , y x , Oi), (G^ y^ 
0 k ) from the posterior distribution over reconciliations 
and other parameters. 



where s(x,z] is the probability that a gene lineage 
starting at x does not reach any leaf / e L(S' X )\L(S'J. 
However, if y = z, the expression reduces to the fol- 
lowing, 

s(x,y, u) = pii{x,y)s{x,y)p{l{p{u),u)lt{x,y))s(y,y,u) . 

5. If xe V(S')\V(S), 

s (x, x, u) = 2X I ^ 5 (x, y, v) I I ^ 5 (x, y,w) I , 

\yeDW\M / \yeD(x)\M / 

where D{x) is the set of descendants of x. 

6. If x e V {S') \ V (S) and z is the child of x in the 
discretized species tree 5', 

( \ * < p(l(p(u),u)/t(x,y)) 

s(x,y,u) = pn (x, z) — —. t — ; — s (z, y,u) . 

V ; p(l(p(u),u)/t(z,y)) V ' 

However, if y = z, the expression reduces to the fol- 
lowing, 

s{x,y,u) = pn{x,y)p(l(p(u),u)/t{x,y))s(y,y,u) . 



The generation probability and d-realization sampling 

In [4], a DP algorithm for computing the factor p(G, l\8, S) 
was described (Figure 2). The DP makes use of a table, 
s(x, y, u), defined as the probability that when a single 
gene lineage starts to evolve at the vertex xe V (5"), the 
tree G" is generated together with the edge lengths / and, 
moreover, the event corresponding to u occurs at y e ViS 1 ). 
Let v and w be children of u in G, and let x, y, z be vertices 
of V (S'). Let p(r) be the probability that an edge of G 
has rate r. Also, let t(x, y) be the time between vertices 
x, y e V (S"). The following recursions describe how the 
table s can be computed: 

1. If u g 1(G) and x = a(u), s(x, x, u) = 1. 

2. If x g V (S) and x * o{u), s(x, x, u) = 0. 



In any reconciliation or d-realization, the planted root 
of G is mapped to the planted root of S. The probability 
that the gene tree G is generated is the probability that 
when a single lineage starts at the root of S, the root of 
G occurs somewhere below the planted root of 5 and 
then the process continues and generates G. Hence, 

p(G,l\0,S) = J2 s iP'Y,r), 

yeD(p) 

where p is the planted root of 5, D(p) its descendants, 
and r is the root of G. Consequently, the probability 
that r is mapped to y e V (5') by a d-realization sampled 
from all d-realizations according to the posterior prob- 
ability distribution under observed G and /, is 
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.s(j-..r, m) = ( S (A (/.«'))( Yl ■"(■''■ !>■"')) 

yeD L (x) yeD H (x) 



(A) 



s(a(u).a(u),u) = 1 



sir. ./'. u) - I) 





„ , w -, p(/(p(»).K)A(*..y)) , 

«w0 ~ P u^m pm ^ u)/t(Zty)) Kz,y,u). 



, . , ./>('(/>(")• «)/'(*..v)) . . 
p(l(p(u),u)/t(z,y)) 




(E) 



s(.v,.v.h) = 2X 




(F) 



£ i(jT,>',H') 



Figure 2 Recursion to compute p(G, S). Shows six possible scenarios of a gene tree evolving inside the species tree and how they relate 
to dynamic programming calculations. (A) A gene tree lineage starts and reaches y, where y is a leaf of an extant species. (B) An internal vertex 
of a gene tree cannot be mapped to a speciation vertex other than the least common ancestor of its children in species tree. (C) An internal 
vertex of a gene tree may be placed on a speciation node (the least common ancestor of its children in the species tree). (D) A gene subtree 
rooted at u, starts from x (a speciation vertex), duplicates at y and yields the subtree below. (E) A gene subtree rooted at u is placed at some 
discretization point on an edge of species tree. (F) A gene lineage starts from a discretization point on an edge of species tree, and yields the 
gene subtree rooted at u. 



$ (p, y, r) 



s {p, y, r) 



P(G,1\0,S) £ zeD(p) s(p,z,r)- 

Similarly, if we know that a d-realization maps a vertex 
u e V (G) to a vertex x e V (S")> then the probability that 
a child v of u is mapped to y by a realization sampled 
from all such d-realizations, according to the posterior 
probability distribution under observed G and /, is 

s (x, y, u) 



T,zeD(x) S ( X ' Z > U )' 

This clearly provides an algorithm for sampling d-rea- 
lizations according the posterior probability distribution 
under observed G and I. 

Again, there is a unique reconciliation associated with 
each realization, and the posterior probability of a 
reconciliation is approximated by the sum of the 



posterior probabilities of the d-realizations associated 
with it. Thus, we can sample a reconciliation from the 
posterior distribution, over those, by sampling a d-reali- 
zation, from the posterior over those, and then output- 
ting the associated reconciliation. 

Computing the MAP d-realization 

We now give recursions that imply a DP algorithm for 
computing the MAP reconciliation. The following recur- 
sions describe how m can be computed, and as appar- 
ent, they are very similar to those above for s: 

1. If u e L(G) and x = (j(u), m(x, x, u) = 1. 

2. If x 6 V (S) and x * <j(u), m(x, x, u) = 0. 

3. If x e V (5) \L(S) and x = o(u), 



m (x, x, u) 



max m (x, y, v) 

yeD L (x) 



max m (x, y, w) 
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where D L (x) and D R {x) are the descendants of left 
and right child of x in S', respectively. 

4. If x £ V (S) and z is a child of x such that 
er(L(G u )) c L(S' Z ) and z is an ancestor of y, 

i \ „ , s , -s p{ 1 {p (")'") A^y)) / x 

m (x, y, u) = pn (x, z) e (x, z) 777—— — r— -. — rr-m (z, y, u) , 
p(/(p(M),u)/t(z,y)) 

where s(x,z) is the probability that a gene lineage 
starting at x does not reach any leaf / e L(S^)\L(S^.). 
However, if y = z, the expression reduces to the fol- 
lowing, 

m (x, y, m) = pn (x, y) e (x, y) p (/ (p (m) , u) /t (x, y)) m (y, y, u) . 

5. Ifxe 1/(S')W(5), 



m (x, x, u) = 2X ( max m (x, y, v) ) ( max m (x, y, w) ) , 

\yzD(x)\{x] V VyeDW\W 7 



where D(x) is the set of descendants of x. 

6. If x g V (5") \ V (S) and z is the child of # in the 

discretized species tree S", 

m (x, y, u) = p n (x, z) — (z, y, u) . 

p(l(p(u),u)/t(z,y)) 

However, if y = z, the expression reduces to the fol- 
lowing, 

m (x, y, u) = pn (x, y)p{l{p{u), u)/t(x, y))m(y, y, u). 

We now get an expression for the probability of the 
MAP d-realizations, very similar to that of p(G, l\0, S), 

p(G,l,a\6,S) m(p,y,r) 
max = max 



« p(G,l\e,S) yeD(p)p(G,l\9,S) 

When computing the probability of the MAP d- 
realizations, we can use the standard technique of back- 
pointers, i.e., keep track of the subsolution that gives the 
maximum value, and after the computation of m, trace the 
backpointers in order to find a MAP d-realization. 



3. If x e V (S) \L{S) and x = jyu), 

s (x, x,u)= I ^2 s ( x ' Y' v ) I I s ( x ' Yi w ) I ' 

\y<iD L (x)nR(y{v)) J \yaDu(x)nR(y{w)) J 

where D L (x) and D R (x) are the descendants of left 
and right child of x in S', respectively. 

4. If x g V (S) and z is a child of x such that 
cr(L(G u )) C L(S' Z ) and z is an ancestor of y, 

s (x, y, u) = p n (x, z) g (x, z) — r—r — rr-5 (z, y, u) , 

where s(x,z] is the probability that a gene lineage 
starting at x does not reach any leaf / e L(S' X )\L(S' Z ). 
However, if y = z, the expression reduces to the fol- 
lowing, 

s(x, y, u) = pn (x, y)e(x, y)p{l{p{u), u)/t(x, y))s(y, y, u). 

5. If *e 

s (x, x, u) = 2X I ^ s(x, y, v) I I ^ s{x,y,w) I , 

\y£DWnR(r(v))\(x) / \y€D(i)WnR(i(»)) / 

where is the set of descendants of x. 

6. If x e 1/ (5') \ 1/ (5) and z is the child of x in the 
discretized species tree 5', 

( \ * < p(l(p(u),u)/t(x,yj) 
s (x, y, u) = pn {x, z) — rr-s (z, y, u) . 

p(l(p(u),u)/t(z,y)) 

However, if y = z, the expression reduces to the fol- 
lowing, 

s(x, y, u) = pn (x, y)p(l(p(u), u)/t{x, y))s(y, y, u). 
Finally, 

p(G,l,y\9,S)= s iP'Y,r), 

yeD(p)nR(y(r)) 



Posterior probability of a given reconciliation 

We now give recursions for computing the posterior 
probability of a given reconciliation 7, i.e., p(G, I, y\6, S). 
The reconciliation is a mapping from V (G) to V {S) U E 
(5). Let 7? be the function from V (S) U £(5) to V (S') 
defined by (i) for x e V (5), R(x) = x and (ii) for {x, y) e 
E(S), R(x, y) is the set of internal vertices on the unique 
path between x and y in S. 

1. If u g L{G) and x = y(u), s{x, x, u) = 1. 

2. If x g V (S) and x * y(u), s(x, x, u) = 0. 



where p is the planted root of S, D(p) its descen- 
dants, and r is the root of G. 



Comparing reconciliations 

We are interested in quantifying the difference between 
two reconciliations 7 and / of G and S, in particular 
between a reconciliation we have sampled from the pos- 
terior and the MPR. To this end, we introduce two dis- 
tance measures. First, however, an atomary distance 
between objects in V (S) U E(S) is defined, so that for 
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any vertex u e V (G), the distance between y(u) and 
/(w) is well-defined. We can then compute the two 
reconciliation distance measures, namely (i) the maxi- 
mum atomary distance over vertices of G, and (ii) the 
average atomary distance over vertices of G (Figure 3). 

Assume that a, b e V (S) U E(S). Let / be the length of 
the minimum length path of S that contains both a and b, 
and let d(a, b) = / + 1 - \{a, b} n V (S)\/2 - \{a, b) n E(S)\. 
So, for instance, if a is a vertex and b is the edge to the 
parent of a, then d(a, b) = 1 + 1- 1/2-1 = 0.5, and if a = 
(x, p s (x)) (where p s (•) denotes the parent function 
in S) and b = (p s {x), p s ips (*))> then d(a, b) = 2 + 1-1- 
1 = 1. We are now ready to define our distances between 
reconciliations: the max distance is 

distance max {y , y') = max ueV {G)d[y{u), y'{u)), 

and the average distance is 

/ a E^v( G ) d {y («)>/(«)) 

distance ave (y,y ) = . 

avg V' y > \V(G)\L(G)\ 

Data 

See Supplementary Material & methods (additional file 1). 
Results 

We applied our methods to the vertebrates clade of the 
OPTIC dataset, [9], consisting of the following nine ver- 
tebrate species: Tetraodon nigroviridis (pufferfish), 
Monodelphis domestica (gray short-tailed opossum), 
Canis familiaris (dog), Mus musculus (house mouse), 
Homo sapiens (human), Ornithorhynchus anatinus 



(platypus), Taeniopygia guttata (zebra finch), Gallus gal- 
lus (red junglefowl), and Anolis carolinensis (Carolina 
anole). After basic filtering, 13812 gene families were 
selected for analysis, see supplementary Material and 
methods in additional file 1. 

For each gene family, using the MCMC-based analysis 
tool PrIME-DLRS [4] [11], a posterior distribution was 
obtained over gene trees, edge lengths and other para- 
meters, given gene sequences and the species tree. The 
expected number of duplications under the posterior dis- 
tribution, given the gene families and the species tree, 
was then estimated by sampling d-realizations and 
recording the number of duplications occurring at any 
specific discretization vertex. The number of duplications 
for all discretization vertices of the species tree were then 
normalized to 11 levels and each level was assigned a 
specific color. So, the colored heatmap illustrates how 
frequent duplications have been across the species tree. 
We also investigated enrichment of functional categories 
among the gene families with higher expected number of 
duplications over an edge. Finally, the appropriateness of 
MPR was investigated, by estimating the expected aver- 
age and maximum distance, respectively, to MPR over 
reconciliations sampled from the posterior; a few families 
for which MPR was found to be unsuitable were analyzed 
further. 

Heatmaps 

Heatmaps of the number of the duplications for the 
posterior distribution over realizations were generated, 
and provide a visualization of the duplication patterns 
across the edges of the species tree, Figure 4A. 
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(A) 



(B) 



(C) 




Figure 4 Heatmaps of duplications across the discretized edges of the species tree (A) The heatmap is generated after normalizing the 
duplications across the tree using eleven different colors. (B) The heatmap is generated after normalizing the duplications across the tree 
without including the common ancestral edge. (C) The heatmap is generated by normalizing the duplications for each edge of the species tree. 



The highest number of duplications were observed at 
the common ancestral edge of all the species. This 
could be interpreted as support of the 2R hypothesis 
proposed by Ohno [12], which suggests that the gen- 
ome of early vertebrates underwent two whole genome 
duplications. An alternative explanation could be that 
incorrect gene trees in the posterior distributions give 
rise to duplication that reconciliations tend to place 
close to the planted root. In order to test the latter, we 
performed a Maximum aposteriori (MAP) analysis 
based on only gene families with MAP gene trees hav- 
ing posterior probability grather than or equal to 0.5. 
Heatmaps based on this data, supplementary Figure 1 
(supplementary results in additional file 1), showed the 
same trend. 

The common ancestral edge of all the species except 
Puffer Fish had the second highest number of duplica- 
tions among all the edges of the species tree as shown in 
Figure 4A. In order to study the more recent lineages 
more closely, we normalized the duplications across the 
species tree without the discretization points of common 
ancestral edge, see Figure 4B. As the figure shows, the 
common ancestral edge of Human, Dog, and Mouse as 
well as the edge leading to Zebra Finch have comparatively 
higher frequencies of duplications. The higher frequencies 
of duplication on the edge leading to Boreoeutheria 



(ancestral edge of Human, Dog and Mouse) was also 
reported recently by Boussau et al [13]. 

We decided to explore the families that contribute to 
these duplications. Tools for performing enrichment 
analysis allows analysis of extant but not ancestral spe- 
cies and, moreover, when studying duplicating genes, 
choosing representative genes in extant species is com- 
plicated by the fact that the number of representatives 
can be varied. We, therefore, decided to work with gene 
families, rather than genes, and implemented this by 
using a single representative gene for each family. We 
selected the representative gene for an edge if its gene 
family were found to be likely to duplicate at least once 
on the edge. This set of genes was then annotated using 
the Functional Annotation Clustering (DAVID) [14,15] 
tool given the background of all the representative genes 
of the gene families of the dataset. For the common 
ancestral edge of Human, Dog, and Mouse, the follow- 
ing clusters had a Benjamini-Hochberg false discovery 
rate less than or equal to 0.01: ATP Binding, Chromo- 
some Segregation/Mitosis, ATPase activity coupled to 
movement of substances, Drug Metabolism, Helicase 
Activity, Mitotic Sister Chromatid Segregation, Fatty 
Acid Metabolism/Tryptophan Metabolism and Death- 
like Domain. Using the same criteria for the ancestral 
edge of Zebra Finch, gave the following clusters: Transit 
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Peptide, Mitochondrion, A TP Binding, Flavoprotein and 
Nucleotide Phosphatebinding region. 

In order to know how the frequency of duplications 
vary across each edge of the species tree, the heatmap 
was also normalized across edges (see Figure 4C). In 
most cases, there is a unimodal behavior of an indivi- 
dual edge. This may be explained by the relatively low 
number of discretization vertices and also that the signal 
in the sequence data may not be strong enough to 
reveal a more complex trend. For individual gene 
families, however, more complex trends are exhibited. 

Distance from MPR 

We computed two distances, i.e., the average distance 
and maximum distance from MPR, over the posterior 
distribution. The distribution of average distance 
between the sampled reconciliations and the MPR is 
shown in the Figure 5. The MPRs dominates the distri- 
bution of sampled reconciliations with approximately 
81% of all sampled reconciliations. We computed the 
expected distance for posterior reconciliations of indivi- 
dual gene families to MPR, in order to identify gene 
families with a clear signal for early duplications, i.e., 
early in the sense of being inferred as significantly ear- 
lier than according to MPR. 

The expected distance of a family was estimated as the 
expected average distance to MPR over reconciliations 



sampled from the posterior. The results showed that a 
number of gene families have a higher expected distance 
from the MPR, which means that MPR does not explain 
the true evolutionary history well in those cases. About 
13% of all families had expected maximum distance 
equal to or greater than 0.5. The distribution of the 
expected maximum distance and the expected average 
distance of the gene families from the MPR are shown 
in Figure 6. 

We selected four gene families for further analysis. They 
had a clear signal for early duplications and at least one 
gene from every species of the dataset. One of the four 
selected gene families was Short chain dehydrogenase. It 
has a clear signal in favor of non-MPR reconciliations as 
shown in Figure 7. Most of the reconciliations sampled for 
this family had average distances between 0.5 and 0.6, 
comprising around 74% of all sampled reconciliations. For 
this gene family, not a single reconciliation sampled was 
identical to the MPR. This gene family is annotated as 
steroid hormone biosynthesis. 

Conclusion 

We have presented methods for sampling and computing 
MAP reconciliations as well as d-realizations. Using these 
methods and the OPTIC dataset, we have provided the 
first biologically realistic estimate of the appropriateness 
of MPR. It was found that one can expect approximately 
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Figure 5 Average distance from MPR across gene families. X-axis represents average distance of a reconciliation from the MPR. Y-axis 
represents the percentage of count of reconciliations having a certain average distance from the MPR 
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Figure 6 Expected distances from the MPR for the 13812 analyzed gene families. (A) Distribution of the expected maximum distance to 
MPR. (B) Distribution of the expected average distance to MPR.. 



i 




Figure 7 Average distance from MPR for the gene family Short chain dehydrogenase (A) Distribution of the average distance to MPR over 
10000 sampled reconciliations. (B) The pie chart shows the shares of reconciliations having a certain average distance from the MPR. The labels 
shows the distance from the MPR. 



19% of reconciliations to be different from MPR. Also, 
13% of gene families can be expected to have a maximum 
distance of greater than or equal to 0.5 to the MPR. 
Among other reasons, this is interesting because some 
gene tree reconstuction algorithms evaluate gene trees 
using only MPR. We have also shown how, based on our 
methods, heatmaps can be constructed that illustrates 
how frequent duplications are across the species tree and 
that for vertebrates such a strategy identifies two recent 
edges as having hosted frequent duplications. Finally, 
enrichment analysis can identify functional classes 
among gene families that are duplicated on a specific spe- 
cies tree edge. 

Additional material 



Additional file 1: (PDF) 
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